# Zero-Shot Learning
Nal V40 Sdxl
Other
Nalgotic Dreams is a text-to-image model based on Stable Diffusion XL, specializing in generating high-quality anime-style images, particularly bright and detailed illustrations of girl characters.
Image Generation English
N
John6666
203
1
Openvision Vit Small Patch8 384
Apache-2.0
OpenVision is a fully open, cost-effective family of advanced vision encoders focused on multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
21
0
Openvision Vit Tiny Patch8 224
Apache-2.0
OpenVision is a fully open, cost-effective advanced vision encoder family focused on multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
123
0
Openvision Vit Tiny Patch16 384
Apache-2.0
OpenVision is a fully open, cost-effective advanced vision encoder family focused on multimodal learning.
O
UCSC-VLAA
19
0
THUDM.GLM 4 32B 0414 GGUF
GLM-4-32B-0414 is a large-scale language model developed by THUDM, with 32 billion parameters, suitable for various text generation tasks.
Large Language Model
T
DevQuasar
13.15k
5
PURE
PURE is the first framework to employ a Multimodal Large Language Model (MLLM) as the backbone network for solving low-level vision tasks.
Image Enhancement
Safetensors
P
nonwhy
326
1
Qwen.qwen2.5 VL 72B Instruct GGUF
Qwen2.5-VL-72B-Instruct is a large-scale vision-language model developed by the Tongyi Qianwen team, supporting multimodal understanding and generation of images and text.
Image-to-Text
Q
DevQuasar
281
0
Vit So400m Patch16 Siglip 512.v2 Webli
Apache-2.0
A vision Transformer model based on SigLIP 2, designed for image feature extraction and suitable for multilingual vision-language tasks.
Text-to-Image
Transformers

V
timm
2,766
0
Vit So400m Patch14 Siglip Gap 378.v2 Webli
Apache-2.0
Vision Transformer model based on SigLIP 2 architecture, pre-trained on WebLI dataset, with attention pooling head removed and global average pooling applied
Image Classification
Transformers

V
timm
20
0
Vit Base Patch16 Siglip 256.v2 Webli
Apache-2.0
A ViT image encoder based on SigLIP 2 for extracting image features, supporting multilingual vision-language tasks.
Text-to-Image
Transformers

V
timm
731
2
Phi 4 Model Stock V2
Phi-4-Model-Stock-v2 is a large language model merged from multiple Phi-4 variant models using the model_stock merging method, demonstrating strong performance across multiple benchmarks.
Large Language Model
Transformers

P
bunnycore
56
2
Videolisa 3.8B
Apache-2.0
This model is a video language-guided reasoning segmentation model developed based on LLaVA-Phi-3-mini-4k-instruct, focusing on object segmentation tasks in videos.
Text-to-Image
Safetensors English
V
ZechenBai
247
6
Omnigen V1
MIT
OmniGen is a unified image generation model that supports multiple image generation tasks.
Image Generation
O
Shitao
5,886
309
Lumina Mgpt 7B 768
Lumina-mGPT is a family of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions, and capable of performing various vision and language tasks.
Text-to-Image
Transformers

L
Alpha-VLLM
1,944
33
Mambavision B 1K
Apache-2.0
PAVE is a model focused on repairing and adapting video large language models, aiming to enhance the conversion capability between video and text.
Video-to-Text
Transformers

M
nvidia
1,082
11
Llama3 Med42 8B
Med42-v2 is a clinically aligned large language model suite developed by M42, based on the LLaMA-3 architecture, available in 8B or 70B parameter versions, designed to provide high-quality medical Q&A capabilities.
Large Language Model
Transformers English

L
m42-health
6,755
66
Llama3 Med42 70B
Med42-v2 is an open-access clinical large language model suite developed by M42, built on LLaMA-3, containing 8 billion or 70 billion parameters, capable of high-quality medical question answering.
Large Language Model
Transformers English

L
m42-health
11.10k
46
Resnet50 Facial Emotion Recognition
Apache-2.0
This is an AI model released under the Apache-2.0 license, with specific functionalities to be determined based on the actual model type
Large Language Model
Transformers

R
KhaldiAbderrhmane
50
3
Libra 11b Base
Apache-2.0
Libra is a decoupled vision system built upon large language models, possessing fundamental multimodal understanding capabilities.
Image-to-Text
Transformers

L
YifanXu
18
0
Yuna Ai V3
Yuna AI is a virtual companion model designed for emotional connection, offering deep interactive experiences beyond traditional assistants.
Large Language Model Supports Multiple Languages
Y
yukiarimo
139
10
Vitamin XL 256px
MIT
ViTamin-XL-256px is a vision-language model based on the ViTamin architecture, designed for efficient visual feature extraction and multimodal tasks, supporting high-resolution image processing.
Text-to-Image
Transformers

V
jienengchen
655
1
Moai 7B
MIT
MoAI is a large-scale language and vision hybrid model capable of processing both image and text inputs to generate text outputs.
Image-to-Text
Transformers

M
BK-Lee
183
45
Llava Maid 7B DPO GGUF
LLaVA is a large language and vision assistant model capable of handling multimodal tasks involving images and text.
Image-to-Text
L
megaaziib
99
4
Erasedraw
MIT
EraseDraw is a diffusion model-based image editing tool capable of inserting or modifying objects in images based on text prompts.
Image Generation
E
alpercanberk
30
3
Supermario Slerp V2
Apache-2.0
supermario-slerp-v2 is a text generation model created by merging two 7B-parameter models using the SLERP method, demonstrating outstanding performance across multiple benchmarks.
Large Language Model
Transformers English

S
jan-hq
15
2
Vit Gpt2 Image Captioning
Apache-2.0
This is an image captioning model based on the Vision Encoder-Decoder architecture, capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

V
baseplate
55
2
Cartoonizer
MIT
An instruction-tuned version based on Stable Diffusion v1.5, specifically designed for image cartoonization
Image Generation Other
C
instruction-tuning-sd
232
76
Bart Ranker
MIT
This model is used to predict the relevance of query-document pairs, suitable for information retrieval tasks.
Text Embedding
Transformers

B
bsl
31
3
Chinese Clip Vit Base Patch16
The base version of Chinese CLIP, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder, trained on a large-scale dataset of approximately 200 million Chinese image-text pairs.
Text-to-Image
Transformers

C
OFA-Sys
49.02k
104
Rare Puppers3
This is an image classification model generated by HuggingPics, specifically designed to recognize specific categories of images.
Image Classification
Transformers

R
Samlit
28
0
Llama Horse Zebra
This is an image classification model generated by HuggingPics, capable of accurately identifying animals such as horses, llamas, and zebras.
Image Classification
Transformers

L
osanseviero
38
0
Featured Recommended AI Models